Summarizing Key Concepts using Citation Sentences
نویسندگان
چکیده
Citations have great potential to be a valuable resource in mining the bioscience literature (Nakov et al., 2004). The text around citations (or citances) tends to state biological facts with reference to the original papers that discovered them. The cited facts are typically stated in a more concise way in the citing papers than in the original. We hypothesize that in many cases, as time goes by, the citation sentences can more accurately indicate the most important contributions of a paper than its original abstract. One can use various NLP tools to identify and normalize the important entities in (a) the abstract of the original article, (b) the body of the original article, and (c) the citances to the article. We hypothesize that grouping entities by their occurrence in the citances represents a better summary of the original paper than using only the first two sources of information. To help determine the utility of the approach, we are applying it to the problem of identifying articles that discuss critical residue functionality, for use in PhyloFacts a phylogenomic database (Sjolander, 2004). Consider the article shown in Figure 1. This paper is a prominent one, published in 1992, with nearly 500 papers citing it. For about 200 of these papers, we downloaded the sentences that surround the citation within the full text. Some examples are shown in Figure 2. We are developing a statistical model that will group these entities into potentially overlapping groups, where each group represents a central idea in the original paper. In the example shown, some of the citances emphasize what the paper reports about the structural elements of the SH2 domain, whereas other emphasize its findings on interactions and others focus on the critical residues. Often several articles are cited in the same citance, so it is important to untangle which entities belong to which citation; by pursuing overlapping sets, our model should be able to eliminate most spurious references. The same entity is often described in many different ways. Prior work has shown how to use redundant information across citations to help normalize entities (Wellner et al., 2004; Pasula et al., 2003); similar techniques may work with entities mentioned in citances. This can be combined with prior work on normalizing entity names in bioscience text, e.g, (Morgan et al., 2004). For a detailed review of related work see (Nakov et al., 2004). By emphasizing entities the model potentially misses important relationships between the entities. It remains to be determined whether or not relationships must be modeled explicitly in order to create a useful summary.
منابع مشابه
Rediscovering ACL Discoveries Through the Lens of ACL Anthology Network Citing Sentences
The ACL Anthology Network (AAN)1 is a comprehensive manually curated networked database of citations and collaborations in the field of Computational Linguistics. Each citation edge in AAN is associated with one or more citing sentences. A citing sentence is one that appears in a scientific article and contains an explicit reference to another article. In this paper, we shed the light on the us...
متن کاملth Annual Meeting of the Association for Computational
The ACL Anthology Network (AAN)1 is a comprehensive manually curated networked database of citations and collaborations in the field of Computational Linguistics. Each citation edge in AAN is associated with one or more citing sentences. A citing sentence is one that appears in a scientific article and contains an explicit reference to another article. In this paper, we shed the light on the us...
متن کاملCitances: Citation Sentences for Semantic Analysis of Bioscience Text
We propose the use of the text of the sentences surrounding citations as an important tool for semantic interpretation of bioscience text. We hypothesize several different uses of citation sentences (which we call citances), including the creation of training and testing data for semantic analysis (especially for entity and relation recognition), synonym set creation, database curation, documen...
متن کاملWordNet-based Summarization of Unstructured Document
This paper presents an improved and practical approach to automatically summarizing unstructured document by extracting the most relevant sentences from plain text or html version of original document. This technique proposed is based upon Key Sentences using statistical method and WordNet. Experimental results show that our approach compares favourably to a commercial text summarizer, and some...
متن کاملA trainable algorithm for summarizing news stories
This work proposes a trainable system for summarizing news and obtaining an approximate argumentative structure of the source text. To achieve these goals we use several techniques and heuristics, such as detecting the main concepts in the text, connectivity between sentences, occurrence of proper nouns, anaphors, discourse markers and a binary-tree representation (due to the use of an agglomer...
متن کامل